17 research outputs found

    Robust Estimators are Hard to Compute

    Get PDF
    In modern statistics, the robust estimation of parameters of a regression hyperplane is a central problem. Robustness means that the estimation is not or only slightly affected by outliers in the data. In this paper, it is shown that the following robust estimators are hard to compute: LMS, LQS, LTS, LTA, MCD, MVE, Constrained M estimator, Projection Depth (PD) and Stahel-Donoho. In addition, a data set is presented such that the ltsReg-procedure of R has probability less than 0.0001 of finding a correct answer. Furthermore, it is described, how to design new robust estimators. --Computational statistics,complexity theory,robust statistics,algorithms,search heuristics

    Repeated median and hybrid filters

    Get PDF
    Standard median filters preserve abrupt shifts (edges) and remove impulsive noise (outliers) from a constant signal but they deteriorate in trend periods. FIR median hybrid (FMH) filters are more flexible and also preserve shifts, but they are much more vulnerable to outliers. Application of robust regression methods, in particular of the repeated median, has been suggested for removing subsequent outliers from a signal with trends. A fast algorithm for updating the repeated median in linear time using quadratic space is given in Bernholt and Fried (2003). We construct repeated median hybrid filters to combine the robustness properties of the repeated median with the edge preservation ability of FMH filters. An algorithm for updating the repeated median is presented which needs only linear space. We also investigate analytical properties of these filters and compare their performance via simulations. --Signal extraction,Drifts,Jumps,Outliers,Update algorithm

    Modified repeated median filters

    Get PDF
    We discuss moving window techniques for fast extraction of a signal comprising monotonic trends and abrupt shifts from a noisy time series with irrelevant spikes. Running medians remove spikes and preserve shifts, but they deteriorate in trend periods. Modified trimmed mean filters use a robust scale estimate such as the median absolute deviation about the median (MAD) to select an adaptive amount of trimming. Application of robust regression, particularly of the repeated median, has been suggested for improving upon the median in trend periods. We combine these ideas and construct modified filters based on the repeated median offering better shift preservation. All these filters are compared w.r.t. fundamental analytical properties and in basic data situations. An algorithm for the update of the MAD running in time O(log n) for window width n is presented as well. --signal extraction,robust filtering,drifts,jumps,outliers,computational geometry,update algorithm

    Computing the Least Quartile Difference Estimator in the Plane

    Get PDF
    A common problem in linear regression is that largely aberrant values can strongly influence the results. The least quartile difference (LQD) regression estimator is highly robust, since it can resist up to almost 50% largely deviant data values without becoming extremely biased. Additionally, it shows good behavior on Gaussian data – in contrast to many other robust regression methods. However, the LQD is not widely used yet due to the high computational effort needed when using common algorithms, e.g. the subset algorithm of Rousseeuw and Leroy. For computing the LQD estimator for n data points in the plane, we propose a randomized algorithm with expected running time O(n2 log2 n) and an approximation algorithm with a running time of roughly O(n2 log n). It can be expected that the practical relevance of the LQD estimator will strongly increase thereby. --

    Constrained Minkowski Sums: A Geometric Framework for Solving Interval Problems inComputational Biology Efficiently

    Get PDF
    In this paper, we introduce the notion of a constrained Minkowski sum: for two (finite) point-sets P,Q⊆ℝ2 and a set of k inequalities Ax≥b, it is defined as the point-set (P ⊕ Q) Ax≥b ={x=p+q∣p∈P,q∈Q,Ax≥b}. We show that typical interval problems from computational biology can be solved by computing a set containing the vertices of the convex hull of an appropriately constrained Minkowski sum. We provide an algorithm for computing such a set with running time O(Nlog N), where N=|P|+|Q| if k is fixed. For the special case (PQ)x1β(P\oplus Q)_{x_{1}\geq \beta} where P and Q consist of points with integer x 1-coordinates whose absolute values are bounded by O(N), we even achieve a linear running time O(N). We thereby obtain a linear running time for many interval problems from the literature and improve upon the best known running times for some of them. The main advantage of the presented approach is that it provides a general framework within which a broad variety of interval problems can be modeled and solve

    Detecting high-order interactions of single nucleotide polymorphisms using genetic programming

    Get PDF
    Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --

    Constrained Minkowski Sums: A Geometric Framework for Solving Interval Problems in Computational Biology Efficiently

    Get PDF
    In this paper, we introduce the notion of a constrained Minkowski sum: for two (finite) point-sets P, Q subset of R-2 and a set of k inequalities Ax >= b, it is defined as the point-set (P circle plus Q)(Ax >= b) = {x = p + q vertical bar p is an element of P, q is an element of Q, Ax >= b}. We show that typical interval problems from computational biology can be solved by computing a set containing the vertices of the convex hull of an appropriately constrained Minkowski sum. We provide an algorithm for computing such a set with running time O (N log N), where N = vertical bar P vertical bar + vertical bar Q vertical bar if k is fixed. For the special case (P circle plus Q)(x1 >=beta) where P and Q consist of points with integer x(1)-coordinates whose absolute values are bounded by O(N), we even achieve a linear running time O(N). We thereby obtain a linear running time for many interval problems from the literature and improve upon the best known running times for some of them. The main advantage of the presented approach is that it provides a general framework within which a broad variety of interval problems can be modeled and solved

    Computing the least median of squares estimator in time O(n d

    No full text
    In modern statistics, the robust estimation of parameters of a regression hyperplane is a central problem, i. e., an estimation that is not or only slightly affected by outliers in the data. In this paper we will consider the least median of squares (LMS) estimator. For n points in d dimensions we describe a randomized algorithm for LMS running in O � n d � time and O(n) space, for d fixed, and in time O � d 3 · (2n) d � and O(dn) space, for arbitrary d

    Algorithms, Theory

    No full text
    In this paper, we introduce the notion of a constrained Minkowski sum which for two (finite) point-sets P, Q ⊆ R 2 and a set of k inequalities Ax ≥ b is defined as the point-set (P ⊕Q)Ax≥b = {x = p+q | p ∈ P, q ∈ Q, Ax ≥ b}. We show that typical subsequence problems from computational biology can be solved by computing a set containing the vertices of the convex hull of an appropriately constrained Minkowski sum. We provide an algorithm for computing such a set with running time O(N log N), where N = |P | + |Q | if k is fixed. For the special case (P ⊕ Q)x1≥β, where P and Q consist of points with integer x1-coordinates whose absolute values are bounded by O(N), we even achieve a linear running time O(N). We thereby obtain a linear running time for many subsequence problems from the literature and improve upon the best known running times for some of them. The main advantage of the presented approach is that it provides a general framework within which a broad variety of subsequence problems can be modeled and solved. This includes objective functions and constraints which are even more complex than the ones considered before
    corecore